Introduction and data info

This document shows a worked example of analyzing 24-hour accelerometry data, using a subset of the open-access Long-term Movement Monitoring Database. I’ve already summarized the raw 100 Hz acceleration data and summarized it in one-minute epochs as “MIMS units” via the MIMSunit R package (link here). MIMS units are very similar to “counts” available from Actigraph devices, but they have some technical and practical advantages. Check out the paper for more on MIMS units.

The code and data accompanying this demo can be found at GitHub here.

I made this demo to accompany my talk for the 2023 Mathematical Sciences in Obesity Research short course. This is an educational analysis meant to be quick and feasible on a lower-end laptop, which put some hard constraints on how much and how complex of a dataset we can work with. It also skips over some of the finer points that really do matter in research studies, like what to do about non-wear-time and other sources of missingness in your data. Also, in part because of the small size of the dataset, we don’t get particularly interesting or insightful results. Repeat this analysis on the entire NHANES cohort and it’d probably be a different story.

The present dataset consists of waist-worn accelerometer data from one 24-hour day (midnight to midnight) from 50 older adults. These older adults were classified as “fallers” (people who had a history of falls) and “controls” (no history of falls). This is a subset of the larger LTMMD; I preprocessed only subjects with a reasonable amount of wear time in their first 24-hour day of device wear. The full dataset is larger (~80 subjects) and has multiple days of data from each subject.

Main goals

This educational demo is intended to showcase a few aspects of analyzing 24-hour accelerometer data:

Read in data

As noted above, the data are already preprocessed into one-minute epochs. This substantially reduces the size of the data we need to download. Notice how for each subject we’re calculating a total_MIMS variable that is just a sum of all of their accumulated MIMS units throughout the day.

library(tidyverse)
library(viridis) #pretty colors
library(mgcv) #for semiparametric regression
library(refund) #for functional regression
library(reshape2) #just for melt
library(gratia) #Viewing mgcv model results

# --- Load in our files and main dataframe ----
main_df <- read_csv("data/main_df.csv", show_col_types = FALSE)
all_files <- list.files("data/day_mims/", pattern="*.csv")

#Preallocate
df_list <- list()
day_list <- list()

subject_df_list <- list()
Y <- matrix(data=NA,nrow=length(all_files),ncol=1440) #1440 minutes in a day

for (i in 1:length(all_files)){
  f <- all_files[i]
  this_subject <- gsub("\\_mims.csv$", "", f)
  this_df <- read_csv(paste("data/day_mims/", f, sep=''), show_col_types = FALSE) %>%
    mutate(hours = seq(0,24,length.out=1440),
           sub_code = this_subject)
  
  #Total activity
  this_subject_df <- main_df %>% filter(sub_code == this_subject) %>% 
    mutate(total_mims = sum(this_df$MIMS_UNIT))
  
  day_list[[i]] <- this_df
  Y[i,] <- this_df$MIMS_UNIT #Drop in 24hr activity data
  df_list[[i]] <- this_subject_df #Slightly untidy but works
}

scalar_df <- bind_rows(df_list)
day_df <- bind_rows(day_list)

Examine one day

Here’s a pretty typical day from the dataset:

ix <- 2 #Which subject? 

hour_seq <- seq(0, 24, by = 3)
formatted_hours <- sprintf("%02d:00", hour_seq)

ggplot(mapping = aes(x=seq(0,24,length.out=1440), 
                     y = Y[ix,])) + 
  geom_ribbon(aes(ymax = Y[ix,], ymin=0), fill = "navy", alpha = 0.4) + 
  geom_line(linewidth = 0.5, color = "navy") + 
  ggtitle("24hrs of activity data, 1min summary") + 
  scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
                     labels = formatted_hours,
                     name = 'Time of day') +
  scale_y_continuous(name = 'Activity level (MIMS units)') + 
  theme(plot.title = element_text(hjust=0.5))

And here’s data from several subjects:

plt_subs <- day_df$sub_code %>% unique() %>% head(9)

day_df %>%
  filter(sub_code %in% plt_subs) %>%
  ggplot(aes(x=hours, y=MIMS_UNIT, group = sub_code)) + 
  geom_ribbon(aes(ymax = MIMS_UNIT, ymin=0), fill = "navy", alpha = 0.4) + 
  geom_line(linewidth = 0.25, color="navy") + 
  facet_wrap(~sub_code, ncol=3)

It’s also worth looking at the histogram of MIMS units throughout the day:

ggplot(mapping = aes(x=Y[ix,])) + 
  geom_histogram(aes(y = ..density..), colour="black", fill="navy", 
                 alpha = 0.2, bins = 25) + 
  geom_density(color = "navy", size=1) + 
  ggtitle("Distribution of activity level during the day") + 
  scale_x_continuous(name = "Activity level (MIMS units)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

It’s very right-skewed! This can be modestly improved with a log(x+1) transform (+1 because there are zero values).

ggplot(mapping = aes(x=log(Y[ix,] + 1))) + 
  geom_histogram(aes(y = ..density..), colour="black", fill="navy", 
                 alpha = 0.2, bins = 25) + 
  geom_density(color = "navy", size=1) + 
  ggtitle("Distribution of activity level during the day") + 
  scale_x_continuous(name = "Log(activity + 1)") 

Aggregating daily activity

A simple way to summarize someone’s activity level is to just add up their activity across the day, as we did above. Here’s the distribution of total MIMS units, at the subject level:

scalar_df %>%
  ggplot(aes(x=total_mims)) + 
  geom_histogram(aes(y = ..density..), colour="black", fill="navy", 
                 alpha = 0.2, bins = 15) + 
  geom_density(color = "navy", size=1) + 
  ggtitle("Sum of daily activity (N=50 subjects)") + 
  scale_x_continuous(name = "Total MIMS units") 

Again, the situation improves a bit with a log transform (no plus one here since there are no zeros).

scalar_df %>%
  ggplot(aes(x=log(total_mims))) + 
  geom_histogram(aes(y = ..density..), colour="black", fill="navy", 
                 alpha = 0.2, bins = 15) + 
  geom_density(color = "navy", size=1) + 
  ggtitle("Sum of daily activity (N=50 subjects)") + 
  scale_x_continuous(name = "log(Total MIMS units)") 

From a modeling situation either modeling total_mims or log(total_mims) is likely justifiable. Here’s a simple model testing to see if there is a (smooth, possibly nonlinear) association between age and total activity level.

#k is number of knots for the spline - with REML penalization, we just need "enough"
# - too many is not really a big concern
k <- 7 # an approximate rule for k is sqrt() of number of unique values, here 49
age_model <- gam(total_mims ~ s(age, bs="cr", k=k),
                 method = "REML",
                 data = scalar_df)

summary(age_model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## total_mims ~ s(age, bs = "cr", k = k)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2840.6      152.7    18.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value  
## s(age)   1      1 4.067  0.0493 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0589   Deviance explained = 7.81%
## -REML = 405.99  Scale est. = 1.1664e+06  n = 50
gratia::draw(age_model)

So, we have weak evidence that people get less active with age (not surprising). In this case it does seem like a linear fit would be sufficient, as evidenced in part by the edf of 1 for the smooth term for age. With larger datasets this will not generally be the case, though. We could’ve fit to log(total_mims); it improves the R^2 value a bit, but doesn’t make a huge difference.

In any case, here’s a linear fit to the data:

scalar_df %>% 
  ggplot(aes(x=age, y=total_mims)) + 
  geom_point() + 
  geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

Analyzing the 24-hour activity cycle

A more sophisticated analysis might involve looking at how 24-hour activity changes with age. This, too, can be modeled; we need a functional data model to do so. Below, I use the regression framework introduced by Sonja Greven and Fabian Scheipl in the refund package, which extends the scalar GAM from above to work with functional data. Here, our “function” of interest is the 24-hour activity curve that we plotted earlier. Each subject’s 24-hour day is one observation, and we can study how that observation changes as a function of covariates like age or whether this subject has a history of falls.

Greven and Schiepl’s framework is very clever: it reframes functional regression as scalar regression, with the residuals being i.i.d. conditional on the model. That just means you need to model all of the sources of non-iid variation in the data, and your model will be correctly specified. The biggest change from the scalar model is the introduction of smooth residuals, \(E_i(t)\), which are modeled as random functional effects.

The interface is pretty similar to MGCV. Here’s code that fits a functional version of the same model from above, to study how the 24hr activity changes as a function of age.

For simplicity, we consider only a constant linear effect of age (c(age)), though remember on the log scale this implies multiplicative effects.

#Warning - takes a while to run!

#for now
scalar_df$age5_centered <- (scalar_df$age - 75)/5
#Standardize to 75 yr old, 5yr change = one unit change for \beta (t)
scalar_df$sub_code <- factor(scalar_df$sub_code) #mgcv requires factors for bs="re"

Y_lp1 <- log(Y + 1) # Log(x+1) transform, better for Gaussian link function

y_ix <- seq(0,24, length.out = 1440) #Time index, hours (0-24)
k_int <- 24 #knots for global intercept - try 16-24
k_beta <- 64 #knots for functional effect including random effects - try 48-64
ctrl <- mgcv::gam.control(trace = TRUE) # verbose=1


mod_fx <- pffr(Y_lp1 ~ c(age5_centered) + s(sub_code, bs="re"),
               data = scalar_df,
               yind = y_ix, #Vector of time index, from 0 to 24 hours
               bs.yindex = list(bs="cp", k=k_beta, m=c(2,1)), #k cubic P-splines
               bs.int = list(bs = "cp", k=k_int, m=c(2,1)),
               algorithm = "bam",
               method = "fREML", #fast REML approximation
               control = ctrl,
               discrete = TRUE) #use discrete approximation for speed
## Setup complete. Calling fit
## Fit complete. Finishing gam object.
##           user.self sys.self elapsed
## initial        0.19     0.02    0.21
## gam.setup     20.29     0.17   20.48
## pre-fit        0.00     0.00    0.00
## fit         1041.77     8.19 1050.78
## finalise       0.01     0.00    0.02
summary(mod_fx, re.test = FALSE) #re.test slows down summary, not needed here
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Y_lp1 ~ c(age5_centered) + s(sub_code, bs = "re")
## 
## Constant coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.61230    0.01916  31.958  < 2e-16 ***
## age5_centered -0.04612    0.01657  -2.783  0.00538 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Smooth terms & functional coefficients:
##      edf Ref.df F p-value
## 
## R-sq.(adj) =   0.63   Deviance explained = 64.6%
## fREML score =  63022  Scale est. = 0.29327   n = 72000 (50 x 1440)
plot(mod_fx, pages=1)

#Hack way to plot results
pred_df <- predict(mod_fx) %>% as.data.frame() %>%
  pivot_longer(everything(), names_prefix  = "V", values_to = "yhat", 
               names_to = "sample") %>%
  mutate(sample = as.numeric(sample),
         hours = (sample-1)/1439*24) %>% #care, off by one
  mutate(yhat_raw = exp(yhat) - 1) %>%
  mutate(sub_code = rep(scalar_df$sub_code, each = 1440))

plot_pred_df <- pred_df %>% 
  filter(sub_code %in% plt_subs)

day_df %>%
  filter(sub_code %in% plt_subs) %>%
  ggplot(aes(x=hours, y=MIMS_UNIT, group = sub_code)) + 
  geom_ribbon(aes(ymax = MIMS_UNIT, ymin=0), fill = "navy", alpha = 0.2) + 
  geom_line(linewidth = 0.25, color="navy", alpha = 0.4) + 
  geom_line(aes(y=yhat_raw), data = plot_pred_df, 
            color = "red", linewidth = 1, alpha = 0.6) + 
  facet_wrap(~sub_code) + 
  scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.02,0)) 

View on linear predictor scale (log(x+1))

day_df %>%
  mutate(log_p1_mims = log(MIMS_UNIT + 1)) %>%
  filter(sub_code %in% plt_subs) %>%
  ggplot(aes(x=hours, y=log_p1_mims, group = sub_code)) + 
  geom_ribbon(aes(ymax = log_p1_mims, ymin=0), fill = "navy", alpha = 0.2) + 
  geom_line(linewidth = 0.25, color="navy", alpha = 0.4) + 
  geom_line(aes(y=yhat), data = plot_pred_df, 
            color = "red", linewidth = 1, alpha = 0.6) + 
  facet_wrap(~sub_code) + 
  scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.02,0)) 

Viewing functional model:

#Hacking the underlying GAM to get raw predictions
#y_ix.vec name is a consequence of passing variable y_ix earlier. 
#If we used t, it would be called t.vec
int_df <- data.frame(y_ix.vec = seq(0,24,length.out = 1440)) %>%
  mutate(sub_code = factor("CO002"),
         age5_centered = 1) 
#Will ignore subject when constructing intercept, needed to dodge predict warning
      
#iterms to include intercept uncertainty
beta_preds <- mgcv::predict.gam(mod_fx, type="iterms", 
                                newdata = int_df, se.fit = TRUE)

#Then do response for yhat on lp1 scale
smooth_select <- paste("s(y_ix.vec):", "c(age5_centered)", sep="")
smooth_select <- "age5_centered" #if c(age5_centered)

y_constant <- attr(beta_preds, "constant") 
y_functional <- beta_preds$fit[,"s(y_ix.vec)"]
y_functional_se <- beta_preds$se.fit[,"s(y_ix.vec)"]
beta_t <- beta_preds$fit[,smooth_select]

#LP - linear predictor + 95% CIs
y_LP <- y_constant + y_functional
y_LP_low <- y_constant + y_functional - 1.96*y_functional_se
y_LP_hi <- y_constant + y_functional + 1.96*y_functional_se
#Also hack-ish, would be better with tidyfun:: methods but would introduce more dependencies
beta_df <- data.frame(hours = seq(0,24,length.out=1440),
                      y_LP_65 = y_constant + y_functional + -2*beta_t,
                      y_LP_70 = y_constant + y_functional + -1*beta_t,
                      y_LP_75 = y_constant + y_functional + 0*beta_t,
                      y_LP_80 = y_constant + y_functional + 1*beta_t,
                      y_LP_85 = y_constant + y_functional + 2*beta_t,
                      y_LP = y_constant + y_functional,
                      y_LP_low = y_constant + y_functional - 1.96*y_functional_se,
                      y_LP_hi = y_constant + y_functional + 1.96*y_functional_se) %>%
  mutate(yhat_raw = exp(y_LP) - 1,
         yhat_raw_low = exp(y_LP_low) - 1,
         yhat_raw_hi = exp(y_LP_hi) - 1
         ) %>%
  mutate(y_65 = exp(y_LP_65) - 1,
         y_70 = exp(y_LP_70) - 1,
         y_75 = exp(y_LP_75) - 1,
         y_80 = exp(y_LP_80) - 1,
         y_85 = exp(y_LP_85) - 1)
                      
beta_df %>%
  ggplot(aes(x=hours, y=yhat_raw)) + 
  geom_line(color = "navy") + 
  geom_ribbon(aes(ymin = yhat_raw_low, ymax = yhat_raw_hi),
              fill = "navy", alpha = 0.2) + 
  scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
                     labels = formatted_hours,
                     name = 'Time of day') +
  ggtitle("Estimated average daily activity pattern for a 75 yr old") 

Plotting age estimates:

lwd <- 2
age_col <-viridis(5, direction = 1) #5 colors for 65-85

beta_df %>%
  ggplot(aes(x=hours, y=yhat_raw)) + 
  geom_line(aes(y=y_65), color = age_col[1], linewidth = lwd) +
  geom_line(aes(y=y_70), color = age_col[2], linewidth = lwd) +
  geom_line(aes(y=y_75), color = age_col[3], linewidth = lwd) + 
  geom_line(aes(y=y_80), color = age_col[4], linewidth = lwd) + 
  geom_line(aes(y=y_85), color = age_col[5], linewidth = lwd)  + 
  geom_point(aes(color = hours), alpha = 0) + 
  scale_color_viridis_c(option="viridis", limits = c(65,85),
                        breaks = c(65, 70, 75, 80, 85), 
                        labels = c('65', '70', '75', '80', '85')) + 
  guides(color = guide_colorbar(title = "Age (yrs)")) + 
  scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
                     labels = formatted_hours,
                     name = 'Time of day') +
  ggtitle("Estimated effect of age on 24-hr activity") + 
  theme(legend.position = "bottom") 

Avenues for improvement

Some residual diagnosics

We can check how “good” our iid assumption is by plotting the residual correlation matrix. A perfectly-specified pffr model should be an identity matrix (all off-diagonal correlations = 0)

I suspect the worse-ish autocorrelation in the upper left (remember, that’s midnight to ~5am) is from some, but not all, subjects taking the device off at night. As above, that might be addressed by a cleverer transformation than log(x+1).

resid_E = residuals(mod_fx)
R_E <- cor(resid_E)

melted_R <- melt(R_E)
ggplot(data = melted_R, aes(x=Var2, y=Var1, fill=value)) + 
  geom_raster() + 
  scale_x_continuous(expand = c(0,0)) + 
  scale_y_reverse(expand = c(0,0)) + 
  scale_fill_gradientn(colors = c("#e41a1c", "white", "#377eb8"),
                         limits = c(-1, 1),
                         breaks = c(-1, 0, 1))  + 
  ggtitle('Residual autocorrelation in original model')

Fit an intentionally-misspecified model and compare

If we skip the subject/curve-specific random effects (one in the same since we have one day from each subject) we misspecify the model and should have a lot more residual autocorrelation. Do we?

mod_mis <- pffr(Y_lp1 ~ c(age5_centered),
               data = scalar_df,
               yind = y_ix, #Vector of time index, from 0 to 24 hours
               bs.yindex = list(bs="cp", k=k_beta, m=c(2,1)), #k cubic P-splines
               bs.int = list(bs = "cp", k=k_int, m=c(2,1)),
               algorithm = "bam",
               method = "fREML", #fast REML approximation
               control = ctrl,
               discrete = TRUE) #use discrete approximation for speed
## Setup complete. Calling fit
## Fit complete. Finishing gam object.
##           user.self sys.self elapsed
## initial        0.03        0    0.03
## gam.setup      0.00        0    0.00
## pre-fit        0.00        0    0.00
## fit            0.02        0    0.01
## finalise       0.01        0    0.02
#summary(mod_mis, re.test = FALSE)
#plot(mod_mis, pages=1)

resid_M = residuals(mod_mis)
R_M <- cor(resid_M)

melted_RM <- melt(R_M)


ggplot(data = melted_RM, aes(x=Var2, y=Var1, fill=value)) + 
  geom_raster() + 
  scale_x_continuous(expand = c(0,0)) + 
  scale_y_reverse(expand = c(0,0)) + 
  scale_fill_gradientn(colors = c("#e41a1c", "white", "#377eb8"),
                       limits = c(-1, 1),
                       breaks = c(-1, 0, 1)) + 
  ggtitle('Residual autocorrelation in misspecified model')

Much worse!